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Abstract 

Regression analysis, for prediction purposes, with compositional data is the subject of this 
paper. We examine both cases when compositional data are either response or predictor vari¬ 
ables. A parametric model is assumed but the interest lies in the accuracy of the predicted 
values. For this reason, a data based power transformation is employed in both cases and the 
results are compared with the standard log-ratio approach. There are some interesting results 
and one advantage of the methods proposed here is the handling of the zero values. 
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1 Introduction 

Compositional data are positive multivariate data whose vector elements sum to the same constant 
usually taken to be 1 for convenience purposes. Data of this type arise in many disciplines, such as 
geology, ecology, archaeology, economics, geochemistry, biology, bioinformatics, political sciences 
and forensic sciences among others. In mathematical terms, their sample space, called simplex, is 
given by 


§ d = { (x 1 ,...,x D ) rj 


D 

Xj > 0, X x i = 1 
i=l 


(i) 


where D denotes the number of variables (better known as components) and d = D — 1. 


There are various techi 

variables. See for example 

liques 

Aitcb 

for re 

ison 

gression analysis with compositional data being the response 
2003) who used classical methods on a log-ratio transformed 

space and 

Stephens 

(1982 

) and 

Scealv and Welsh 

(2003 

) who transformed the data on the surface 


of a unit hype r-sp here usin g the sc 


employed by Gueorguieva et al. (2008|) and 


uare root transformation. Dirichlet regression models have been 
Maier (2011). 


When zero value s exist in d ata, Dirichlet m o dels a nd the log-ratio transformation suggested 
by lAitchison (|1982l . [2003|) and lEgozcne et al. (1200,11 ) will not work unless a zero value impu¬ 
tation js_agplied first. The square root transformation on the other hand treats them naturally 
(iScealv and Welsh 2003), but the procedure is not easy to implement. 
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We propose the use of a newly suggested data based power transformation .Tsagris et ah (2011) 
to perform regression analysis. The multivariate logit link function is necessary to ensure that the 
fitted values lie within the simplex. The free parameter of the power transformation is chosen such 
that that the discrepancy between the observed and the fitted values is minimized. This implies 
that the use of Kullback-Leibler divergence will be of great importance in achieving this goal. 

The big advantage of this type of regression is that a) zeros are handled naturally and thus no 
zero imputation technique prior to the analysis is necessary and b) more accurate fitted values can 
be obtained. The disadvantage on the other hand is that we loose in terms of statistical properties 
of the estimated coefficients. Hence, this is more a data mining procedure. 

The inverse situation, where compositional data are in the predictor variables side has not been 
looked at, thoroughly, in the literature as far as w e are aw are o f. One possible solution would be 


techniques, see for example lEgozcue et al. (|2011h and 


to apply the isometric log- ratio transforma t ion (lEgozcue et al. 1 20031) fs ee H]) and then standard 


Hr on et al. 


(2012). But this approach has 


two drawbacks: first it does not account for any collinearity between the components and second 
it is not applicable when zero values are present. 

Collinearity problems can be attacked by principal component regression. Zero values require 
imputation, so that the isometric log-ratio transformation can be applied. This however means 
that the values of the rest components of each vector, which contains at least one zero value, will 
have to change. In some data sets, there can be many zeros spread in many observations. That 
would mean that many observations would have to change values, even slightly. This adds extra 
variation to the data though. 

In this paper we propose a novel parametric regression method whos e ultim ate purpose is pre¬ 


diction inference. A data based power transformation (jTsagris et al. , 2011) involving one free 


parameter, a is employed at first. We suggest a way to choose the value of a which leads to the 
optimal results. The results show th at pred iction can be more accurate when one u ses a trans¬ 


formation other than the isometric (Egozcue et al. 
19821 ). 


2003) or the additive log-ratio (Aitchison 


A similar approach is exhibited when for the case of compositional data being predictor vari- 

(2012) who used the isometric log-ratio trans- 


Hron et al. 


ables. Thus, this extends the work by 
formation only. However, our work focuses mainly on prediction and not on inference regarding 
the regression coefficients. 

Alternatively, if the number of explanatory variables is rather small, standard linear regression 
analysis can be carried out. In either case, the a-transformation handles zero values in the data 
naturally (if present) and principal component or A;-NN regression handles possible collinearity 
problems. The key message is that a transformation other than the log-ratio family can lead to 
better results in terms of prediction accuracy. 

Regression analysis when compositional data are the response variables is described in Section 
2. Simulation studies and a real data example illustrates the methodology. Section 3 describes the 
principal component regression when compositional data are the predictor variables. A real data 
example shows interesting results. Finally, conclusions close this paper. 


2 The ct-transformation for compositional data 


The a-transformation was pro posed bylTsaeris et al. J (2011) as a more general than the isometric 


log-ratio transformation (Egozcue et al. , 2003) and is a data based power transformation involving 
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one free pow er para mete r, s imilarly to the Box-Cox transformation. The power transformation 
suggested by Aitchison (2003) is 


u = 


b D 


XS=^ 


£?=i x ? 


( 2 ) 


and in terms of that dehne 


z = -H(Du- j D ), 
a 


(3) 


where x £ S d , H is the d x D Helmert sub-matrix (the Helmert matrix Lancaster , 1965 without 
the first row) and jp is the .D-dimensional vector of Is. Note that the power transformed vector 
u in (|2]) remains in the simplex S d , whereas z is mapped onto a subset of R d . Note also that d3]) 
is simply a linear transformation of (l2l) and moreover as a —>• 0, ([3|) converges to the isometric 


log-ratio transformation (jEgozcue et al. I2003T ) defined as 


v = 


( ! * 

H log xi - — ^2 lo g ■ 
V 3=1 


'Ji 


..., log x D - 


i D V 

-^iogxyj . 


(4) 


We can clearly see that when there are zero values in the compositional data the isometric 
log-ratio transformation ([U is not applicable because the logarithm of zero is undefined. For this 
reason, we will also examine the zero value imputation briefly mentioned in the next Section. 


2.1 The cr-regression 

We will use the inverse of the additive logistic transformation, combined with the a-transformation, 
as a link function. This is a new regression using the a-transformation ([3]) which allows for more 
flexibility even in the presence of zero values. Another feature of this method is that the line is 
always curved (unless a is far away from zero) and so it can be seen not only as a generalization 
of the log-ratio regression but also as a flexible type compositional regression in the sense that the 
curvature of the line is chosen based on some discrepancy criteria, examined later. 

In order for the fitted values to satisfy the constraint imposed by the simplex we model the 
inverse of the additive logistic transformation of the mean response. Hence, the fitted values will 
always lie within S d and we also retain the flexibility the a-transformation offers. 

We assume that the conditional mean of the observed composition can be written as a non¬ 
linear function of some covariates 


W = 
hi = 


_j_ 

1+E?=i e XT 
e* T Pi 

1+E?=i e* T0j 


for i = 2,..., D, 


( 5 ) 


where 


Pi 


(A)i> filii ft pi) i i — 1) ■■ 


and p denotes the number of covariates. 
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Then a multivariate linear regression is applied to the a-transformed data 


l {a) = -^log 



1 r 

£ 

- tr 

2 L 


[Y a -M a )± a 1 (Y a -M a ) q 


(6) 


where Y a and M a are the a-transformed response and fitted compositional vectors. We have 
ignored the Jacobian determinant of the a-transformation since it plays no role in the optimization 
process and the choice of a For each value of a we maximize the value of this objective function 
The £ needs not be numerically est im ated, since B, the matrix of the estimates and £ 


(Mardia et ah , 1979) 


are statistically indep endent (jMardia et al. 119791 ). The maximum likelihood estimator of £ is 


£„ = n-^PY,, 

where P = I n — X (X 7 X) 1 X 7 . But since this covariance is not unbiased we will use the unbiased 
estimator 


= (n -p - 1) 


_1 y t PY 

1 a ^ 1 ai 


where X is the design matrix and p is the number of independent variables. 

The consistency of the estimators of the parameters is not an issue in our case since we focus 
on prediction inference. Since the estimation of the parameters depends upon the value of a, 
the estimates will not be consistent, unless that is the true assumed model. The multivariate 
normal is defined in the whole of R 7 but the a-transformation maps the data onto a subset of M 7 . 
Thus, unless there is not too much probability left outside the simplex, the multivariate normal 
distribution might not be the best option. 

The a-transformation what it does essentially is to contract the simplex, cente r it to the origin 
and then project it on a subspace of R 7 by using the Helmert sub-matrix (Lancaster , 1965). So 
if the fitted multivariate normal has high dispersion that will lead to pro bability left outside the 
simplex. The multivariate t distribution was used by Lange et al. (1989) as a more robust, in 
comparison to the multivariate normal, model but even so, it will not be the best option, mainly 
for two reasons. Even if the multivariate t distribution could provide flatter tails, there would still 
be some probability (even less than the normal) left outside the simplex. Secondly, in a regression 
setting, the number of parameters we would have to estimate numerically is increased and this 
would make the maximization process more difficult. 

A final key feature we have to note is that when a —> 0 we end up with the additive log-ratio 
regression ([9]). 


2.1.1 Choosing the optimal a using the Kullback-Leibler divergence 

The disadvantage of the profile log-likelihood, for choosing the value of a, is that it does not allow 
zeros. On the other hand, it provides the maximum likelihood estimates which are asymptotically 
normal. Furthermore, confidence intervals for the true value of a can be constructed. 

We suggest an alternative and perhaps better way of choosing the value of a. Better in the 
sense that it is trying to take into account the proximity between the observed and the fitted 
values. The criterion is to choose the a which minimizes twice the Kullback-Leibler divergence 
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(Kullback , 1997) 


n D 

KL = Vij lo S 

j =1 i =1 


Vij 

Vij 


(7) 


where ijij is the observed compositional point and yij is the corresponding fitted value. The form 
of the deviance for the log-linear models and the logistic regression has the same expression as 
well. Hence, we transfer the same form of divergence to compositional data. For every value of a 
we estimate the parameters of the regression and choose the value of a which minimizes ©. 

The number 2 is there because in the case of D = 2 we end up with the log-likelihood of the 
binary logistic regression. The Kullback-Leibler divergence (J7|) takes into account the divergence 
or the distance of each of the observed values from the fitted values. 

Since we are interested in prediction analysis we should use cross-validation to choose the value 
of a. The reason why we did not choose to go down this road is because of time. The estimation 
of the parameters for a single value of a requires some seconds in a fine desktop. The search 
over many possible values of a requires a few minutes. Thus, even a 1-fold cross validation would 
require a few hours and as the number of independent variables, and/or the sample size increase 
the computational time will increase as well. So, for the shake of speed we avoided this way. 


2.2 Additive log-ratio regression 

The additive log-ratio transformation is defined as 

Zi = log (—\ for i = l,...,d, (8) 

VW 

where d = D — 1, yp is the last component playing the role of the common divisor, but by 
relabelling any component can play this role. We will now show the additive log-ratio regression. 
At first we will see a nice property of the logarithms and its implications on the additive log-ratio 
regression. 

log (—) = x T /Jj log yi = log y D + x T /3j, i = l,...,d (9) 

\VdJ 

where x 7 is a column vector of the design matrix X, D is the number of components and 

Pi = {Poii Plii •••■> Ppi) i i = 1) d 


are the regression coefficients and p is the number of independent variables. 

We see from Q that when the dependent variable is the logarithm of any component, the 
logarithm of the common divisor component can be treated as an offset variable; an independent 
variable with coefficient equal to 1. 

The main disadvantage of this type of regression is that it does not allo w ze ro values in any 
of the components, unless a zero value imputation technique Martin et al- (2012) is applied first. 
Its advantage though is that after the additive log-ratio transformation Q we can do the standard 
multivariate regression analysis. As for the fitted value they are back transformed into the simplex 
using the inverse of (0). 
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2.3 Example: Foraminiferal compositions with some zero values 



depth as the independent variable. Since the data contain zero values, the cc-regression allows only 
strictly positive values for a. Since the composition belongs to S 3 we cannot use the ternary plot; 
we could though use a 3-dimensional pyramid, but it would not show the structure of the data 
clearly. For this reason we will use a barplot. 



0 1.1 1.79 2.2 2.48 2.71 2.89 3 3.09 3.22 3.33 


■ Triloba ■ Obesa ■ Pachyderma ■ Atlantica 


Figure 1: The foraminiferal compositions as a function of the logarithm of the sea water depth. 


A zero value imputation method suggested by Martin et al. (2012) can be briefly summarised 
as follows: Repla ce t he z_ero values by 65% of a threshold value using the non parametric method 


described by (Martin et al. , 2003). The threshold value is different for each component. We used 


the minimum, non zero, value of each component as that threshold. Then the isometric log-ratio 
transformation ([5]) is applied. An E-M algorithm substitutes the replaced values with a value less 
than the chosen threshold. In the end, the data are transformed back to the simplex. This method, 
like all zero value imputation methods, results in all the components of each compositional vector 
being changed, even slightly. A tolerance value (say 0.01) is used as convergence criterion between 
successive iterations. 

We will use this method in order to allow for the additive log-ratio regression to be performed. 
In this case the Kullback-Leibler divergence ([7]) will be calculated for the initial, not the zero 
imputed data. 

The value of the divergence © when regression is applied to the original data with a = 1 is 
equal to 6.123. The value of the divergence © when regression is applied to the zero imputed 
data data with a = 0 is equal to 7.112 and when a = 1 is equal to 6.123. Thus a, value of a 


other than zero leads to better predictions with without zero value imputation. Scealv and Welsh 


([2003) used the same dataset treating the compositional data as directional by applying the square 
root transformation. Their suggested Kent regression (no zero imputation is necessary) produced 
a divergence value equal to 6.344, still lower than the additive log-ratio regression. 

As earlier stated, the purpose of this regression is to provide better fittings to the observed 
data. It is more a data mining procedure than it is a statistical one, from the point of view that 
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(a) (b) 

Figure 2: Twice the Kullback-Leibler divergence © as a function of a. Graph (a) refers to the 
original data and graph (b) refers to the zero imputed data. 


we are interested in estimating/predicting future observations with as high accuracy as possible. 
We are not interested in making inference about the regression coefficients. 

In the next Section, the reverse scenario is of interest, that of the compositional data (containing 
zero values again) playing the role of the covariates. 


3 Principal component regression with compositional data being 
the predictor variables 

As the title of this Section suggests we will focus only in the principal component regression, even 
though we could_mention regression analysis in general. The reason for not doing so is because 


Hron et ah (2013) has already covered this case and we are interested in generalising this idea to 


cover the case of multicollinearity as well. 

Principal component regression is based on principal component analysis and hence we will 
not spend too much time. The algorithm to perform principal component regression (PCR) can 
be described as follows 

1. At first standardize the independent variables. This way, X 7 X (the n x p design matrix, 
which includes the p independent variables but n ot t he intercept term) is proportional to the 
correlation matrix for the predictor variables (Jolliffe , 2005). The n stands for the sample 


size. 

2. Perform eigen analysis on X T X and calculate the matrix of the eigenvectors V and the scores 

Z = XV. 

3. Estimate the regression coefficients by 

B = V (Z T Z) _1 Z T y, 


where y is the vector containing the values of the dependent variable. 
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4. Estimate the covariance matrix of the estimated regression coefficients by 

Var (b) = ct 2 V (Z t Z) _ 1 V t , 

where a 2 is the conditional variance of the dependent variable calculated from the classical 
multiple regression analysis based upon the given number of principal components. It is the 
error variance, whose estimate is the (unbiased) mean squared error. 

The key point is that we can have p different sets of estimated regression coefficients, since we 
can use the first eigenvector (or principal component), the first two eigenvectors or all of them. If 
we use all p of them, then we end up with the same regression coefficients as if we performed a 
classical multiple regression analysis. 

Since we are not interested in statistical inference regarding the coefficients of the principal 
components we can ignore their variance matrix. Another key point that is worthy to mention 
is that we can also use the principal scores as the new predictor variables and do the classical 
regression. The fitted values will be exactly the same. 

The idea here is simple, apply the a-transformation ([3]) to the compositional data and then 
perform principal component regression. The value of a and the number of principal components 
which lead to the optimal results are chosen via cross validation. The optimal results in our case 
will refer to minimization of the mean squared error of prediction. 

3.1 Example: Glass data and refractive index (zero values present) 

In optics the refractive index of a substance (optical medium) is a ’’clean” number that describes 
how light, or any other radiation, propagates through that medium. It is defined as RI = A where 
c is the speed of light in vacuum and v is the speed of light in the substance. The refractive index 
of water, for example, is 1.33. This means that the speed of light in water is reduced by 24.8% 

[(-*■ ~ rh) %]■ 

Surprisingly enough, negative refractive values are also possible, when if permittivity and per¬ 
meability have simultaneous negative values. This can be achieved with periodically constructed 
metamaterials. The negative refraction index (a reversal of Snell’s law) offers the possibility of the 
superlens and other exotic phenomena. 

The glass dataset available from UC Irvine Machine Learning Repository contains 214 obser¬ 
vations regarding glass material from crime scenes, with information about 8 chemical elements, 
in percentage form. These chemical elements are sodium (Na), Magnesium (Mg), aluminium (Al), 
silicon (Si), potassium (K), calcium (Ca), barium (Ba) and iron (Fe). The variable whose values 
we wish to predict based on knowledge of the chemical composition is the refractive index (RI) of 
each glass. 

The category of glass is also available (for example vehicle headlamps, vehicle window glass, 
tableware et cetera). This extra information can be taken into account by principal component 
regression easily. For the kernel regression though, an extra kernel for this discrete variable has to 
be fitted and the final kernel is the product of the two kernels (one for the continuous variables and 
one for this discrete variable). So, the inclusion of non continuous or different types of continuous 
data is not an easy task when kernel regression is to be used. 

We have also included a second PCR model which includes the extra information about the 
data, the category of the glass. The difference from the PCR described is that now we will 
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use the scores of the principal component analysis (see the second key point in Section [3]) to do 
regression. In this way we will also add the categorical variable indicating the category of each 
glass measurement. Table [T] summarizes the results of the five regression models. 


Regression model for the original data 


Method 

Chosen parameters 

MSPE 

Adjusted R J 

PCR 

a = 1 and 7 PCs 

1.237 

0.891 

PCR with the glass 



category 

a = 1 and 7 PCs 

1.02 

0.903 

Regression model for the zero value imputed data 

Method 

Chosen parameters 

MSPE 

Adjusted R 2 

PCR 

a = 0 and 7 PCs 

2.403 

0.784 

PCR 

a = 0.95 and 7 PCs 

1.239 

0.890 

PCR with the glass 



category 

a = 1 and 6 PCs 

1.016 

0.863 


Table 1: Mean squared error of prediction (MSPE) for each regression model applied to the glass 
data. The adjusted coefficient of determination (R 2 ) is calculated using the fitted values from 
principal component regression. 


We can see from Table [T] that even if we replace the zero values the result is the same. The 
isometric log-ratio transformation when applied to the zero imputed data does not lead to the 
optimal results. A value of a far from zero led to much better results, the adjusted R 2 increased 
from 0.772 to 0.89 and the mean squared error of prediction was less than half. iMa rtin ct ah 


(2012) also used a robust E-M algorithm by employing MM estimators (Maronna et ah , 2006). 


The results (not presented here) when robust imputation was performed were slightly worse than 
the ones presented in Table (jT]). 

Even if we impute the zero values, the estimated MSPE, when the isometric log-ratio trans¬ 
formation is used (a = 0), is much higher than with a value of a close to 1 and the value of the 
adjusted R 2 is lower. 



Figure 3: Observed versus estimated refractive indices. Plots (a) and (b) correspond to a = 1 
with 7 PCs without and with information about the glass category, using in the original data. The 
third plot (c) corresponds to a = 0 with 7 PCs using the zero imputed data. 


We also used kernel regression and k -NN regression using many different metrics or distances 
with and without the a-transformation but none of them was as successful as the principal com¬ 
ponent regression. We also tried robust principal component regression for the second example 
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but the results were not better either. 

The best model is the one using all 7 principal components coming from the a-transformed 
data with a = 1 and using the information about the category of glass as we can see in Table [1] 
In mathematical terms it is written as 

RI = 1.345 - O.O 8 IS 1 + 1.3685 2 + O.I 6 IS 3 - O. 8 OIS 4 + 2.5335 5 - 1.6185 6 - 1.329SV 
— 1.567WinF - 1.460WinNF - 2.445Veh - 1.165Con - 1.177Tabl, 

where S',;, for i = 1,... ,7 stands for the scores of each principal component and WinF and WinNF 
stand for the window float and window non-float glass respectively. Veh stands for vehicle window 
glass, Con for containers, Tabl for tableware. The vehicle headlamps is the reference glass category. 
We do not show the standard errors since the choice of the number of principal components, and 
the inclusion of the information about the glass categories was based on the MSPE (Table [Tj). 

The normality of the residuals was rejected according to the Shapiro test (p-value< 0.001). As 
for the independence assumption by looking at Figured we can see that it looks acceptable. There 
are a few outliers in the residuals space as we can see from Figure [4) The a used in this model 
was equal to 1 as we mentioned before. Different values of a will lead to different values bering 
detected as potential outliers. In addition, we detected some outliers in the residuals space and 
not in the space spanned by the 7 principal components. 



Fitted values 


Figure 4: Fitted values versus the standardised residuals for the model. The two horizontal lines 
indicate the region of outliers in the residuals space. 


4 Conclusions 

There are two key messages this paper tries to convey. The first is that a transformation other 
than the isometric of the additive log-ratio transformation should be considered for compositional 
data analysis. It was evident from the example that when the data contain zero values the a- 
transformation handles the data naturally without changing their values even to the slightest. The 
log-ratio transformations require zero value imputation prior to their applications, a requirement 
not met by the a-transformation. 

One can argue though that in both examples values of a close to 1 produced similar results 
as a = 1. This was something to be expected for the data in the first example, but even then, 
this discourages the use of log-ratio as the panacea for all compositional data analysis. We do not 
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disqualify though the use of log-ratio but rather try to show that this very popular transformation 
for compositional data has some drawbacks and in some cases other transformations could be 
applied. In both examples, the differences in the Kulback-Leibler divergence or the MSPE were 
small, but still indicative. 

The second key message is that when compositional data analysis are on the covariates side 
standard regression techniques after the a-transformation should be used with caution. Since 
the data represent percentage allocation, it is very natural that correlations exist even after the 
transformation and thus, multicollinearity should be examined first. In addition, if there are many 
components then variable selection could be applied. These are some reasons why we preferred to 
perform principal component regression. 
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